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Abstract 

Background: For many applications one wishes to decide whether a certain set 
of numbers originates from an equiprobability distribution or whether they are 
unequally distributed. Distributions of relative frequencies may deviate significantly 
from the corresponding probability distributions due to finite sample effects. Hence, 
it is not trivial to discriminate between an equiprobability distribution and non- 
equally distributed probabilities when knowing only frequencies. 
Results: Based on analytical results we provide a software tool which allows to 
decide whether data correspond to an equiprobability distribution. The tool is avail- 
able at http://bioinf.charite.de/equifreq/. 

Conclusions: Its application is demonstrated for the distribution of point muta- 
tions in coding genes. 



Background 



Assume a set of certain events occur with frequencies Mj, i = 1 . . . N, with 

N 

£ M { = M, e.g., Mi = {4, 5, 2, 3, 2, 9, 3, 3, 5, 12, 4, 6, 4, . . . }. We ask the ques- 
i=i 

tion whether the events obey an equiprobability distribution pi = 1/N. Ac- 
cording to the general definition of probabilities 

Pi = lim — — , ( 1 ) 

F M->oo M 

for an equiprobability distribution and for large sample size M it is expected 
to find each of the events approximately Mi = M/N times. For finite sample 
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Fig. 1. Histogram of frequencies of M = 1000 events which have been drawn accord- 
ing to an equiprobability distribution pi = l/N = 1/100. The dashed line displays 
the expectation value. 
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Fig. 2. Same data as in Figure 1 but in rank order. From the figure it might be erro- 
neously concluded that the events do not obey an equidistribution. The distribution 
is deformed, however, exclusively due to finite sample effects. 

size, however, the frequencies may deviate considerably from this value 
(Fig. 1). 

The deviation from the equidistribution becomes particularly obvious if we 
order the events according to their rank, i.e., the most frequently occurring 
event appears left at the abscissa, then the next frequent, etc. (Fig. 2). 

If we conclude naively from the observed frequencies to the probabilities, i.e., 
if we assume Pi/pj = Mi/Mj, in the extreme case M wo = 3 we end up with 
a relative error of 70%. In other words, from the frequencies measured in an 
experiment as shown in Figs. 1 and 2, it might be erroneously concluded that 
the events are strongly non-equally distributed. 

Using the methods of statistics we can generate (predict) the rank ordered 
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frequency distribution for given N and M under the precondition that the 
events are equidistributed [1]. The predicted frequency distribution can then 
be compared with the distribution as measured in an experiment with the 
same values of M and N. From the comparison it can be judged whether the 
events in the experiment obey an equidistribution. 

Following this procedure we describe a tool which helps to decide whether a 
given set of frequencies complies with an equidistribution. For demonstration 
the tool is applied to the distribution of point mutations in human genes. 



Implementation 



The numerical tool is available via the web address 

http://bioinf.charite.de/equifreq/. The underlying kernel program which com- 
putes the most probable frequency distribution is implemented in C + + and 
the user interface is written in PHP. The program source is available at this 
address. 



Results and Discussion 



Mathematical method 



We want to sketch briefly the derivation of the basic formula: Assume we dis- 
tribute M balls over N urns according to an equidistribution. The probability 
p(ki,i) to find ki urns filled each with exactly i balls is given by 

where |_^J denotes the integer of x. 

Note that the probability to find a number of ki urns which contain exactly i 
balls is different form the probability to find the number of urns which contain 
at least i balls which is a simple textbook problem, whereas the derivation of 
Eq. (2) requires quite involved algebra. The relation between both probabilities 
is provided by the exclusion-inclusion principle [2,3]. For our purpose we need 
the number (K^ of urns filled with % balls which are found on average, i.e., we 
need the first moments of the probabilities Eq. (2). These values can be found 
in closed form applying the method of generating functions for the descending 
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factorial moments. The averages have been derived in a different context 
earlier, the details of the derivation can be found in [4,1]: 

(M\ 1 / 1 \ 

As an interesting detail of the solution, the average number of filled urns is 
given by the total number of urns minus the number of empty ones, iV* = 
N-{K ),i.e.[l], 

Obviously, for small M (numbers of balls) there is a significant number of urns 
which, on average, stay empty. Translating back to the language of biology 
we come to a surprising result: given a population of iV = 1000 species. If we 
investigate a number of M — 5000 individuals, from Eq. (4) we obtain iV* m 
993.3, i.e., about 7 species are never found, although from naive reasoning one 
expects each species occurring about 5 times. 

The moments (K^) given in Eq. (3) allow to reconstruct the rank ordered 
frequency distribution since they describe how many, on average, events do 
not occur (zero times), how many occur once, twice, etc. Hence, the desired 
rank ordered frequency distribution reads finally 

N>i> N- (K ) 
N-{K )>i>N- (K ) - (Id) 

(5) 

N - E (K k ) >i>N-h (K k ) . 

k=0 k=0 

We apply Eq. (5) to predict the frequency distribution which arises from an 
equidistribution for different sample sizes M and compare with direct numer- 
ical simulations, s. Figs. 3, 4. The predictions due to Eq. (5) agrees well with 
the numerical experiment. 

Exploration of experimental data 

The theoretical distribution of frequencies due to Eq. (5) can be compared with 
experimentally obtained frequencies. From the distance between both (rank 
ordered) frequency distributions we can conclude whether the experimental 
data obey an equidistribution. To this end we have elaborated a web based tool 
(http://bioinf.charite.de/equifreq/). The user interface offers four alternative 
input masks which differ in the way the input file is generated: 
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Fig. 3. Rank ordered frequency distribution for N = 100 equally distributed events 
for different sample size M. The solid lines show the distribution as predicted by Eq. 
(5), the dashed lines show the distribution of independently drawn equidistributed 
random numbers from the interval [1,-/V]. 
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Fig. 4. Same as Fig. 3 but for larger sample size M. 

(1) The measured frequencies of each species Mj are given directly. 

(2) The number of species N and the total number of individuals M are 
specified. Each individual is assigned a species by chance. 

(3) As for (2) the rank ordered frequencies are computed but with the gen- 
eralization that each species is assigned an individual probability. The 
theoretical basis for this computation is not given here but will be pub- 
lished elsewhere [5]. 

(4) The last input mask is intended for the investigation of the spatial distri- 
bution of point mutation in genes which is presently the most specialized 
application of the described program. 

The program computes the expected frequency distribution due to Eq. (5) 
with the assumption that the species obey an equiprobability distribution. 
Three output files are generated: freq, ktheo and kexp. The file freq contains 
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the rank ordered frequencies as generated from the input data set (cases (1) 
and (4)) or randomly due to an equiprobability distribution (case (2)) or a 
general distribution (case (3)). ktheo contains the moments (Kj) for each rank 
i, i.e. the expected number of individuals occurring i times, due to Eq. (3) for 
given numbers N of species and M of individuals. For cases (1) and (4) the 
values of M and iV are extracted from the input data, for (2) and (3) they are 
provided by the user. (Note that these expectation values are real numbers in 
general.) The third column of line i contains the value M — Y?j=o (Ki)- The 
last file, kexp contains the same data as ktheo, but based on the input data 
(cases (1) and (4)) or on the randomly generated data (cases (2) and (3)), 
respectively. Besides the pure output files the program generates a number of 
visualizations (see section Example: Distribution of point mutation in genes) . 
In order to compare the experimental data with the mathematical prediction 
both, the experimental data and the theoretical data, are plotted in the same 
chart. Congruence of both curves indicates that the experimental data obey an 
equidistribution (case (2)) or the specified distribution (case (3)), respectively. 

It may occur that the curve of the rank ordered experimental data decays 
significantly slower than the corresponding theoretical curve due to Eq. (5). 
Since there is no distribution more homogeneous than the equidistribution 
this situation may occur either as a rare fluctuation (recall that the theoretical 
curve was generated according to the averaged occupation numbers, Eq. (3)). 
In such cases there is no probability distribution {p,{\ which reproduces the 
experiment on average. This case can be artificially evoked when the species 
in the input file occur with almost identical frequencies. 

The difference between the experimental rank ordered frequency distribution 
and the corresponding theoretical distribution (Eq. (5)) evaluates the degree 
of coincidence of the input data with an equidistribution (case (1)) or with a 
specified distribution (case (3)). We define the score by 

M 

1 S = ^|M i -M* hco | . (6) 

The significance of a particular difference score can be assessed by relating it 
to the distribution of difference scores. This distribution depends on M and 
N. 



Example: Distribution of point mutations in genes 



The increasing number of known point mutations and polymorphisms in many 
genes coding for pathogenetically important proteins offers the opportunity 
to apply statistical tests to correlate their type and location to evolutionary, 
biological and clinical features. 
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In each replication generation there occur mutations of the genome but fre- 
quently they remain unnoticed since they do not cause diseases. These so- 
called polymorphisms or variants may occur either in regions of the genome 
which are coding for amino acid sequences or in non-coding segments. Those 
changes of the DNA sequence that alter the amino acid sequence are frequently 
associated with diseases because the respective proteins cannot operate prop- 
erly. Screenings for mutations using DNA of patients have been performed for 
many human diseases and the identified mutations are accessible in mutation 
databases [6]. 

The detection of so-called mutation hot spots, i.e. sequence regions with many 
mutation positions, is important for the identification of the functional and 
genetical properties of the genetic code [7]. These hot spots must be distin- 
guished from statistical fluctuations that occur even when the probabilities 
for mutations are identical for each residue position. Moreover, the spatial 
distribution of point mutations in genes is of importance for the localization 
of coding and non-coding parts in the genome. 

We wish to apply the described method to the investigation of the amino 
acid sequence of the cystic fibrosis transmembrane conductance regulator. 
The unperturbed gene (wild type) is given as a sequence of 1480 letters: MQR- 
SPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLER. . . 
each standing for one amino acid [8] . In experiments there has been observed a 
large number of mutations, i.e., deviations from this sequence. Such mutations 
are available from data bases, e.g. [6]. 

M1V 
M1K 

MU g2X S4X P5L S10R S13F KUX^ 

M Q R S P LEKA S VV S K LFFS 

W19C G27X R31C 

W19X G2J_E g30£ igiL Q39X 

W TRPILRK G YR Q R LELSDIY Q 

S50P 

S42F DUG A46D S5GY 
IP S V D S A DNL S EKLER. .. 

The codes on top of the underbraces stand for the found mutations, e.g., P5L 
means that at position 5 it has been found that the amino acid proline (P) 
was replaced by leucine (L). 
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We subdivided the sequence into 74 parts of equal length 20 and counted the 
number of point mutations in each part. This way we obtain the measured fre- 
quencies Mi = {2, 2, 4, 5, 4, 5, 2, 2, 2, 3, 2, 1, 0, 0, 1, 4, ... } which serve as input 
data. (The subdivision into parts may be repeated with a different starting 
point which yields similar results.) Certainly, measured frequencies as small 
as given above do not allow for the application of the x 2 -t esti - The measured 
frequencies are shown in Fig. 5. Obviously, based on this data it is not possible 




bins i 

Fig. 5. Observed numbers of point mutations. The sequence has been subdivided 
into 74 parts of length 20. 

to decide a priori whether the frequencies are equidistributed. 



After processing the data as described above we obtain the rank ordered mea- 
sured distribution (bars in Fig. 6). The full line shows the expected (theoreti- 
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Fig. 6. Results of the computation. The bars show the rank ordered frequencies, 
the line displays the expected frequency distribution which would be obtained if the 
point mutations were (equally) randomly distributed. The curves differ significantly, 
therefore, we conclude that the point mutations are not random. 

cal) frequency distribution due to Eq. (5) which has been generated with the 
hypothesis that the positions of the point mutations are equidistributed. Both 
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curves deviate significantly from each other, therefore, we conclude that the 
mutations are not equidistributed. This conclusion agrees with the hypothesis 
in ref. [9]. 

Since the investigation of point mutation is an interesting field of application 
of the program we developed a separate input mask for this purpose (case (4) 
of the list in the previous section). The input syntax for this mode is described 
in detail in the online help file of the program. 

Recently, it has been shown for point mutations in the human androgen recep- 
tor (AR) that the severity of the disease correlates with the local sequence con- 
servation [11]. Germline mutations in the gene of the androgen receptor lead 
to the androgen insensitivity syndrome (AIS). In addition it was found that 
somatic point mutations associated with prostate cancer are more frequently 
found at locations with higher sequence variation compared to germline mu- 
tations leading to complete AIS. The related prediction method SIFT [10] 
has been proposed recently. Both methods, SIFT and the method used in [11] 
are based on the alignment of a large number of related proteins. Inspired 
by their observation we asked the question whether mutations in the andro- 
gen receptor are distributed randomly over the sequence depending on the 
association with AIS or prostate cancer. The disease-associated mutations in 
the AR were obtained from the AR gene mutation database [12]. Multiple 
mutations at identical positions were counted only once. Those mutations re- 
sulting in single amino acid substitutions were included in the analysis. The 
test was performed for 61 mutations associated with prostate cancer and 86 
mutations found in patients with complete AIS. To perform the analysis we 
divided the sequence of 919 amino acids into 46 intervals of length 20 and 
counted the number of mutations in each interval. As expected, the results for 
the two datasets were different: Cancer associated mutations are more dissem- 
inated than congenital mutations found in patients with AIS. For mutations 
associated with prostate cancer the bar chart of the rank ordered frequencies 
nearly follows the theoretical curve for equal probabilities (Figs. 7, 8) whereas 
for AIS associated mutations the bar chart deviates markedly from the theo- 
retical curve. Based on this finding we hypothesize that mutagenesis in the 
germline is followed by a selection process so that only a portion of the muta- 
tions are found in patients while others lead to early embryonal or fetal death. 
Conversely, mutations associated with prostate cancer may persist and are 
recorded. 



Conclusions 

For small sample sizes the relative frequencies Mj/M of occurrence of indi- 
viduals of a certain species i deviate significantly from the probabilities of 



9 



11 

10 

^ 9 
<f 8 

§ 7 
« 6 

1 5 

c 4 

3 

2 

1 



o 



L, 




bins i (rank ordered) 

Fig. 7. Distribution of missense mutations in the androgen receptor for germline 
mutations leading to AIS. The height of the bars reflects the rank ordered frequencies 
of mutations in sequence intervals of length 20. The thick line displays the expected 
frequencies which would be obtained if point mutations were randomly distributed. 
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Fig. 8. Distribution of missense mutations in the androgen receptor for somatic 
mutations associated with prostate cancer. Explanation see caption of Fig. 7. 

occurrence pj. With the assumption that the N species occur with equal prob- 
ability Pi = 1/N the expectation values (Kj) of the numbers of events which 
are contained j times (j = 0, . . . , M) in a sample of M individuals can be 
determined based on combinatorial algebra. These expectation values allow 
for a prediction of the rank ordered frequency distribution. 

For many practical problems the amount of available data is insufficient to 
employ standard tests, such as x 2 ? to discriminate whether or not a certain 
set of events complies with an equiprobability distribution. For such situations 
which occur frequently in the biological sciences we have developed an online 
tool which is available at http://bioinf.charite.de/equifreq/. As demonstrated 
for the case of point mutations in the sequence of amino acids of the cystic fi- 
brosis transmembrane conductance regulator and the androgen receptor, even 
for sample set sizes which are certainly not sufficient to decide this question 
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directly from the observed frequencies (see Figs. 3, 4) this tool helps to make 
a reliable statement. 

The proposed method may be generalized to arbitrary probability distribu- 
tions provided there exists a hypothesis on the functional form of the distri- 
bution [13]. For mathematical reasons, however, (see [5]) it is more difficult to 
derive an equivalent to Eq. (5) formula for non-equiprobability distributions, 
which is subject of current research. 



Avalability and requirements 

• Project name: equifreq 

• Project home page: http://bioinf.charite.de/equifreq/ 

• Operating systems: platform independent 

• Programming language: C++ 

• Other requirements: none 

• License: GNU GPL 

• Any restrictions to use by non-academics: none 
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